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(Ce disease is the major cause of deaths all over the world, with 17.9 million deaths 


annually, as per World Health Organization reports. The purpose of this study is to 

enable a cardiologist to early predict the patient’s condition before performing the 
echocardiography test. This study aims to find out whether diastolic function or diastolic 
dysfunction using symptoms through machine learning. We used the unexplored dataset of 
diastolic dysfunction disease in this study and checked the symptoms with cardiologist to be 
enough to predict the disease. For this study, the records of 1285 patients were used, out of 
which 524 patients had diastolic function and the other 761 patients had diastolic dysfunction. 
The input parameters considered in this detection include patient age, gender, BP systolic, BP 
diastolic, BSA, BMI, hypertension, obesity, and Shortness of Breath (SOB). Various machine 
learning algorithms were used for this detection including Random Forest, J.48, Logistic 
Regression, and Support Vector Machine algorithms. As a result, with an accuracy of 85.45%, 
Logistic Regression provided promising results and proved efficient for early prediction of 
cardiac disease. Other algorithms had an accuracy as follow, J.48 (85.21%), Random Forest 
(84.94%), and SVM (84.94%).Using a machine learning tool and a patient’s dataset of diastolic 
dysfunction, we can declare either a patient has cardiac disease or not. 
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Introduction 

Cardiac disease is the major cause of death all over the world, with 17.9 million deaths 
annually, as per the World Health Organization reports[1]. Due to unawareness and some 
unhealthy activities, the causes of the increase in the risk of heart diseases like diabetes, obesity, 
hypertension etc.[2]. Heart disease can be predicted through many dysfunctions, like Systolic 
Dysfunction and Diastolic Dysfunction[3]. Systolic and diastolic function and dysfunction, 
derived from the echocardiography test and Doppler test as previously defined in[4]. Systolic 
and diastolic function are the two states of heart pumping. When the heart pumps blood 
outside the body, this phase is called systolic function, and the relaxing state of the heart when 
it fills the blood from the body is called diastolic function. Diastolic Dysfunction normally 
relates to the age factor. More than 50% of adults at the age of 70 years have diastolic 
dysfunction[5]. Diastolic dysfunction is responsible for nearly half of all heart failure patients 
[6, 7]. In this study, patients’ basic data was collected from the private clinic of the cardiac 
center "The Heart Center Bahawalpur". A machine learning algorithm was used in this study 
to predict the diastolic function or dysfunction conclusion predicted by the cardiologist in this 
private clinic after an echocardiography test. 

The cardiovascular system (CVS) consists of the heart and blood vessels. A wide range 
of dysfunctions can arise in the cardiovascular system. This dysfunction is considered a form 
of cardiovascular disease or heart disease (HD)[8].Heart disease can be predicted with some 
common indicators like age, BP, shortness of breath, hypertension, obesity etc. Through data 
mining or machine learning, it can be determined which features are more important or not 
to predict the specific disease. Machine learning (ML) is a systematic approach to building a 
modal and checking whether some features are able to predict some specific value or not. 
Both supervised and unsupervised machine learning approaches help in a lot of studies to 
predict cardiac disease with both their techniques. Supervised machine learning is used when 
the prediction class is known, but unsupervised machine learning is used when the prediction 
class is unknown. Some studies were conducted to predict the diastolic function with 
supervised machine learning and some studies used unsupervised learning to predict the 
diastolic dysfunction. Unsupervised machine learning was used in the [9-11], which used 
echocardiography test results to make the clusters to predict diastolic function. Another 
study[12], also used unsupervised machine learning to predict diastolic function. Some studies 
use a supervised machine learning approach to predict known class and duke treadmill score 
that is derived after a stress echocardiography test, but in these studies[13-16],predictions were 
made with symptoms using supervised machine learning. Another example of a supervised 
learning technique performed in[17]study, the dataset used the symptoms and predicted 
whether the patient had heart disease or not. 

Echocardiography and the Doppler test can detect diastolic function or dysfunction. 
In a recent study to predict diastolic dysfunction[18], they used the test data and applied a 
machine learning approach to get the best prediction of all four categories: normal, mild, 
moderate, and severe[19]. 

We want to predict a patient's situation with the basic features or symptoms mentioned 
early in this section. For this purpose, we used a unique dataset that has not previously been 
used in any other study. This data was taken from the local clinic of the cardiology center. The 
data that we got from the cardiology center was in SQL format, and the disease conclusion 
was in a paragraph. Symptoms that were taken from patients were in text format. We processed 
the data and formed it into a single table format. After data processing, data mining algorithms 
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were applied to the dataset. The symptoms that were used by the cardiologist were predicted 
toward specific disease in coordination with the expert opinion of the cardiologist. 
Table 1: Previously studies in term of prediction 


Ref Author Accuracy (%) Technique 

[20] Shafenoor et al. 95.00 Used the K-NN and Decision Tree 

[21] Devansh et al. 90.79 Used the K-NN 

[22] Apurtb et al. 93.70 Used the Random Forest 

[23] Archana et al. 87.00 Used the K-NN 

[17] Mamun et al. 100 Used the Random Forest 

[24] Kavitha et al. 88 Used the Decision Tree & Random Forest 

[25] Norma et al. 98.40 Used the HDPM 

[26] Paul et al. 80.00 Used the Neural network with fuzzy logic 

[27] Verma et al. 80.68 Used the Decision tree 

[28] Ismaeel et al. 86.50 Used the Extreme Machine learning 

[29] El-Bialy et al. 78.54 Used the Decision tree 

[30] Subanya et al. 86.76 Used the Support Vector Machine 

[31] Nahar et al. GOA Used the Naive Bayes 

[32] _Tougui et al. 85.86 Used the Artificial Neural Network 
MATERIAL & METHODS 
Data Set 


The dataset we used to conduct this research was collected from "The Heart Center 
Bahawalpur". There was a dataset of 1285 patients including echocardiography results, out of 
which 788 male and 497 female patients. The average age of male patients was between 31 
and 100 years, and female patients’ age was between 30 and 95 years. The BSA of male patients 
was between 0.77 and 2.92, and for female patients it was between 0.54 and 2.62. BMI of male 
patients’ was between 10.28 and 48.82 and for female patients’ it was between 13.84 and 49.58. 
Two features are: blood pressure (BP) systolic is between 48 and 210; blood pressure (BP) 
diastolic is between 19 and 180.These features are patients’ basic information and initial 
checkup. Four features are now risk factors in patients. The first risk factor was hypertension, 
found in 419 patients. Where 215 were male patients and 204 were female patients. 
Hypertension is a very common risk factor in patients these days. Hypertension has increased 
mortality in heart disease patients[33, 34]. The second risk factor considered in our 
experiments is obesity. Obesity risk factors are considered when the patient has more weight 
than the normal range concerning age and height. 189 patients had an obesity risk factor, 
where 85 were male patients and 104 were female. Obesity has a lot of effects on diastolic 
dysfunction as defined in[35, 36]. Shortness of Breath (SOB) starts when the veins of the heart 
become fatter and blood cannot flow properly. The patient will start breathing issues after a 
little jogging or even walking. In our case, 426 patients had SOB risk factors, where 239 were 
male patients and 187 were female. Shortness of breath is the first stage to detect an issue in 
heart patients[37]. 
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Figure 1: Interface for data collection of Cardiac Disease practically available at Heart Center 


Patient Info. 


Bahawalpur 


Patient ID. ig | Name | Age. e@ | 
Clinical Data : 
Physiological : 
B.P. (Systolic) 120 B.P. (Diastolic) 80 
Weight: 78 kg Height: 159 jem BSA: 184 | BMI: 3164 | 
Primary Indication: | 
Risk Factor: Diabetes Mellitus, Hypertension, Smoking, Obesity, Family History, Ischemic Heart Disease, Old Age, Post Met 


Observations : 


History Notes : 
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Misc. : 
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Technologiest : Case: Elective v 
Image Quality: Excellent ¥| MediaID: 2790 
Ref. Dr: | © Jhon v| |“ 
Conclusion : Mild Concentric LVH. Normal Bi Ventricular systolic function. 
Diastolic dysfunction grade |. 
Moderate Pulmonary Hypertension. 
Accadamic: Study : 
Figure 3: Patient test conclusion part 
Table 2: Attribute detail 
Attribute Detail Instances Male Female 
Gender Gender feature show the patient 1285 788 497 
is either male or female. This 
features is in binary format. 
Male: 1; Female: 0 
Age This feature represent patients’ 1285 31-100 30-95 
age (In years) 
BP Systolic Patients’ Upper bound BP 1285 59-210 48-210 
BP Diastolic Patients’ lower bound BP 1285 19-180 30-154 
BSA This feature show the patients’ 1285 0.77-2.92 0.54-2.62 
BSA 
BMI This feature shows the patient’s 1285 10.28-48.82  13.84-49.58 
BMI 
Hypertension This Risk factor show in binary 419 215 204 
format in dataset. Either 1 or 0 
Obesity This feature is in binary format. 189 85 104 
Either 1 or 0 
SOB This risk factor is also in binary 426 239 187 


format. Either 1 or 0 


Data Preparation 

We used the WEKA tool for experiments. Initially, data was provided in SQL database 
format. Where risk factors and primary indications were multi-value. The class variable has 
been marked manually with two values, which are diastolic function and diastolic dysfunction, 
based on the echocardiography test conclusion provided by the cardiologist. With all these 
above-mentioned features, risk factors, and primary indications, we were trying to predict 
patients’ diastolic function without performing echocardiography. 
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Logistic Regression Algorithm. Logistic Regression Classifier Algorithm are supervised 
learning algorithms. LR is an algorithm that is based on statistical techniques. LR creates a 
modal based on input and output variables. LR works with binary values which are specified 
as some are dependent and independent variables. Logistic regression is used in lots of studies 
in healthcare for classification purposes[38, 39]. 

SVM. Support Vector Machine (SVM) is a well-known supervised classification algorithm. 
Algorithm is useful in data analysis and pattern recognition. SVM is a mathematically based 
algorithm for creating a model for data analysis. The SVM algorithm is rapidly used in machine 
learning studies for better classification [40-42]. 

J.48 Algorithm. J.48 Algorithm is widely used in medical data analysis. algorithm has been 
previously used in many studies to predict disease using symptoms[43, 44]. 

Random Forest. Random Forest algorithm is a supervised algorithm widely used in data 
science for classification purposes. RF is a supervised classification algorithm. RF built a tree 
of features for decision-making [45-47]. 


Diastolic function/dysfunction (Gender wise) 
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Figure 4. Diastolic function/dysfunction 
Process Flow 
We use the following sequence to conduct this study. 


Results 


Two techniques were used, split percentage and cross-validation. The split percentage 
technique divides the data into a given percentage of training and testing data. Train data is 
used for building the modal and test data is used to apply the already built modal and get the 
accutacy to check if the modal is worthy for further classification or not. Cross-validation 
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creates the folds and divides the data into each fold, checking each fold. We apply the "Replace 


Missing Values" filter on the data before applying any algorithm. 

The results are attributed in terms of “Precision”, “Recall’’, “F-Measute’’, “MCC” and “ROC 
Area”. 

Accuracy = (IP + TN) / ([P+TN+FP+EN) 

Precision = TP / ([P+FP) 

Recall = TP / (TP+FN) 

F-Measure = 2 * Precision * Recall / (Precision + Recall) [48] 

ROC Area define the test performance guide for classifications accuracy of diagnostic test 
based on: Excellent (90-100), Good (80-89), Fair (70-79), Poor (60-69), Fail (50 — 59). 

TP = True Positive TN = True Negative 

FP = False Positive FN = False Negative 

Random Forest (RF) 

The Random Forest algorithm was used to split the data by 70% and 30% for training 
and testing, respectively. The RF algorithm shows the accuracy of both class variables 
separately. Diastolic Dysfunction correctly identifies 196 patients and incorrectly identifies 36 
patients as having diastolic dysfunction. Diastolic function was correctly identified in 131 
patients, and 22 patients were identified with Diastolic Dysfunction, but these patients had 
diastolic function. The overall accuracy of this algorithm is 84.94%. The below table shows 
the detailed accuracy by class: 

Table 3: Random Forest with split percentage 
RF Evaluation scores in terms of classes 
Precision Recall F-Measure MCC ROC Atea 
D-Dysfunction 89.9% 84.5% 87.1% 69.2% 92.7% 
D-Function 78.4% 85.6% 81.9% 69.2% 92.7% 

The RF algorithm test with 10-fold cross-validation and seed value 2. As a result, 
diastolic dysfunction was correctly identified in 633 patients and incorrectly identified in 128 
patients as diastolic function, while 412 patients with diastolic dysfunction were correctly 
identified and incorrectly identified in 112 patients as diastolic dysfunction. The RF algorithm 
with cross-validation gives us 1045 patients correctly identified out of 1285 patients. The 
overall performance for this algorithm is 81.32%. The below table shows the detailed accuracy 
by class: 


Table 4: Random Forest with cross-validation 
RF Evaluation scores in terms of classes 


Precision Recall F-Measute MCC ROC Area 
D-Dysfunction 85.0% 83.2% 84.1% 61.5% 90.6% 
D-Function 76.3% 78.6% 774% 61.5% 90.6% 


J.48. 

A J.48 supervised classifier was used for classification. The J.48 algorithm was used to 
split the data by 80% and 20% for train and test, respectively. J.48 shows the accuracy of both 
class variables separately. Diastolic dysfunction correctly identified 134 patients and incorrectly 
identified 24 patients as having diastolic dysfunction. Diastolic function was correctly 
identified in 85 patients, and 14 patients were incorrectly identified as diastolic dysfunction, 
but these are diastolic functions. Overall accuracy for the J.48 algorithm split percentage is 
85.21%. The below table shows the detailed accuracy by class: 
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Table 5: J.48 with split percentage 


J. 48 Evaluation scores in terms of classes 


Precision Recall F-Measure MCC ROC Area 
D-Dysfunction 90.5% 84.8% 87.6% 69.6% 91.20% 
D-Function 78.0% 85.9% 81.7% 69.6% 91.20% 


The J.48 algorithm was tested using cross-validation with 5-Fold. As a result, diastolic 
dysfunction was correctly identified in 633 patients and incorrectly identified in 128 patients 
as diastolic function, and 432 patients with diastolic dysfunction were correctly identified and 
incorrectly identified 92 patients as having diastolic dysfunction. The overall performance for 
this algorithm is 82.88%. The below table shows the detailed accuracy by class: 

Table 6: J.48 with cross-validation 
J. 48 Evaluation scores in terms of classes 


Precision Recall F-Measutre MCC ROC Area 
D-Dysfunction 87.3% 83.2% 85.2% 65% 88.3% 
D-Function 77.1% 82.4% 79.7% 65% 88.3% 


Logistic Regression 

A Logistic Regression supervised classifier was used to split the data by 70% and 30% 
for training and testing, respectively. LR shows the accuracy of both class variables separately. 
Diastolic dysfunction was correctly identified in 194 patients and incorrectly identified in 38 
patients. Diastolic function was correctly identified in 135 patients and incorrectly identified 
in 18 patients. Overall accuracy for the LR algorithm is 85.45%. The below table shows the 
detailed accuracy by class: 

Table 7: Logistic Regression with split percentage 
LR Evaluation scores in terms of classes 


Precision Recall F-Measute MCC ROC Area 
D Dysfunction 91.5% 83.6% 87.4% 70.7% 93.9% 
D Function 78.0% 88.2% 82.8% 70.7% 93.9% 


The LR algorithm test uses cross-validation with 5-Fold. As a result, diastolic 
dysfunction was correctly identified in 621 patients and incorrectly identified in 140 patients. 
In 448 patients, diastolic function was correctly identified in 76 patients and incorrectly 
identified in another 76. The overall performance for this algorithm is 83.19%. The below 
table shows the detailed accuracy by class: 

Table 8: Logistic Regression with cross-validation 
LR Evaluation scores in terms of classes 


Precision Recall F-Measute MCC ROC Area 
D-Dysfunction 89.1% 81.6% 85.2% 66.2% 91.6% 
D-Function 76.2% 85.5% 80.6% 66.2% 91.6% 


Support Vector Machine. A SVM supervised classifier was used to split 70% and 30% for 
training and testing, respectively. The SVM shows the accuracy in both class variables 
separately. Diastolic dysfunction was correctly identified in 194 patients and incorrectly 
identified in 38 patients. Diastolic function was correctly identified in 133 patients and 
incorrectly identified in 20 patients. Overall accuracy for the SVM algorithm is 84.94%. The 
below table shows the detailed accuracy by class: 
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Table 9: Support Vector Machine with split percentage 


SVM Evaluation scores in terms of classes 
Precision Recall | F-Measure MCC ROC Area 
D-Dysfunction 90.7% 83.6% 87.0% 69.5% 85.3% 
D-Function 77.8% 86.9% 82.1% 69.5% 85.3% 
The SVM algorithm was used for the cross-validation with 5-Fold and random seed 3. 
As a result, diastolic dysfunction was correctly identified in 618 patients and incorrectly 
identified in 143 patients. In 444 patients, diastolic function was correctly identified in 80 
patients and incorrectly identified in 80 others. The overall performance for this algorithm is 
82.65%. The below table shows the detailed accuracy by class: 
Table 10: Support Vector Machine with cross-validation 
SVM Evaluation scores in terms of classes 
Precision __ Recall F-Measure MCC ROC Area 


D-Dysfunction 88.5% 81.2% 84.7% 65.1% 83.0% 
D-Function 75.6% 84.7% 79.9% 65.1% 83.0% 
DISCUSSION 


Using both techniques, split percentage and cross-validation, the best performance of each 
algorithm is listed below in Table 12 and Table 13 respectively. Each algorithm generates 
approximately equal results, but the highest accuracy originates from the Logistic Regression 
Algorithm using split percentage, which is 85.45%. We compared every result in two tables. 
First with the split percentage of each algorithm with detail accuracy and second with cross- 
validation with class detail accuracy. 

Table 11: All algorithms accuracy by split percentage 
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90 

80 

70 

60 

50 
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20 

10 

Y O ll 
ied Precision Recall F-Measure ROC Atea 
Accuracy 

mRE 84.94 85.4 84.9 85 69.2 92.7 
mJ.48 85.21 85.7 85.2 85.3 69.6 91.2 
LR 85.45 86.2 85.5 85.6 70.7 93.9 
=aSVM 84.94 85.5 84.9 85 69.5 85.3 
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Table 12: All algorithm accuracy with cross-validation 


90 


[ee Precision Recall 

mg RP 81.32 81.4 81.3 
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a LR 83.19 83.8 83.2 
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ROC Area 
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Now, the statistics of each algorithm are shown in the below tables. Table 13 shows the stats 
of each algorithm’s split percentage technique used. Table 14 shows the stats of each algorithm 


stats performed with cross-validation: 


Table 13: Algorithm stats with split percentage 
All classifiers’ stats with split percentage technique 
Kappa Statistic Mean Absolute Error 


RF 0.6903 0.2101 
J.48 0.6936 0.2162 
LR 0.7029 0.2107 
SVM 0.6916 0.1506 


Table 14: Algorithm stats with cross-validation 


All classifiers’ stats with cross validation technique 
Kappa Statistic Mean Absolute Error 


RF 0.6151 0.2269 

J.48 0.6493 0.2291 

LR 0.6585 0.2274 

SVM 0.6473 0.1735 
CONCLUSION 


This work has analyzed the role of machine learning in the medical field for disease 


prediction (diastolic function or dysfunction) before performing the medical test. In this study, 
we take some results of an echocardiography test, which has some features, and lastly, a 
conclusion which is suggested by the cardiologist. Using a machine learning approach, we 
check whether the patient's described symptoms are related to the resultant disease or not. In 
this process, we first get the data from the cardiologist, process it, and then form it into our 
useful shape. Some features related to patient basic data like age, gender, BP, BSA, BMI, and 
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the patient's description of some symptoms ot history like hypertension, obesity, and shortness 
of breath. We have converted these features into columns and then replaced the value with 
binary. Then, machine learning algorithms were used to examine the relationship between 
features and their outcomes. After applying machine learning to patients’ symptoms and 
conclusion data of diastolic function and diastolic dysfunction classification, the logistic 
regression algorithm gives us the best accuracy, up to 85%. It means cardiologists asking for 
symptoms from patients are 85°% enough toward the conclusion. In a further study, features 
and some other algorithms' usage will increase this accuracy, which will be more beneficial for 
cardiologists and be time-efficient. 
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